Linear Regression Visualization

Given a dataset $$\{ (x_i, y_i )\}_{i = 1}^{n}$$ a linear regression model assumes the relationship between the dependent variable $y$ and the vector of regressors $x$ is linear. However in the real world,these relationships are often noisy. So, given a linear regression model with parameters $\beta$, the relationship we are modeling is $$Y = X\beta + \epsilon$$ where $$X = \begin{bmatrix} x_1 & x_2 & ... & x_n\\ \end{bmatrix} \in \mathbb{R}^{n,d}$$

$$Y = \begin{bmatrix} y_1 & y_2 & ... & y_n\\ \end{bmatrix}^{T} \in \mathbb{R}^{n}$$$$\epsilon = \begin{bmatrix} \epsilon_1 & \epsilon_2 & ... & \epsilon_n\\ \end{bmatrix}^{T} \in \mathbb{R}^{n}$$

Here, $\epsilon$ corresponds to a noise term, and in the case of linear regression, $\epsilon \sim \mathcal{N}(0, v)$, in other words the error is assumed to be unbiased and normally distributed.

Fitting a linear model usually requires to minimizing the error of the model. $$\begin{align} Y &= X\beta + \epsilon \\ Y - X\beta &= \epsilon \end{align}$$

More formally, linear regression can be formulated as follows:

$$\min_{b}\|\epsilon\|_{2}^{2} = \min_{b}\|Y - X\beta\|_{2}^{2}$$

We can use calculus to solve for $\beta$ that minimizes the following equation.

$$\begin{align} \|Y - X\beta\|_{2}^{2} &= \beta^{T}X^{T}X\beta - 2y^{T}X\beta + y^{T}y \\ \nabla_{\beta} \|Y - X\beta\|_{2}^{2} &= 2X^{T}X\beta - 2X^{T}y = 0 \\ X^{T}X\beta &= X^{T}y \\ \beta &= (X^{T}X)^{-1}X^{T}y \end{align}$$

Demonstration of single variable linear regression

Using Gradient Descent to perform linear regression

Even though we have an explicit rule to solve for the optimal parameters $\beta$ given a set of datapoints $\{ (x_i, y_i )\}_{i = 1}^{n}$ for linear regression, we can also use the power of gradient descent to solve for an optimal value of $\beta$ as well. Lets go ahead and demonstrate how we can do the following with Stochastic Gradient Descent as our gradient descent algorithm of choice.

Recall the formulation of Linear Regression:

$$\min_{b}\|\epsilon\|_{2}^{2} = \min_{b}\|Y - X\beta\|_{2}^{2} = \min_{b}L(\beta)$$

Now we can use calculus to solve for $\nabla_{\beta} L(\beta) $.

$$\begin{align} L(\beta) &= \beta^{T}X^{T}X\beta - 2y^{T}X\beta + y^{T}y \\ \nabla_{\beta}L(\beta) &= 2X^{T}X\beta - 2X^{T}y\\ \end{align} $$

With an explicit form for $\nabla_{\beta} L(\beta) $ we can now use SGD to optimize $\beta$.

Recall the update rule for SGD with learning rate $\lambda$ at timestep $t$

$$\beta_{t} = \beta_{t - 1} - \nabla_{\beta}L(\beta_{t-1}) * \lambda$$

Now substituting for $\nabla_{\beta} L(\beta) $, we now have an explicit update rule to optimize beta! Here is the update rule below:

$$b_{t} = b_{t - 1} - (2X^{T}X\beta_{t-1} - 2X^{T}y) * \lambda$$

Now, we can optimize $\beta$ with a random initialization for $\beta_{0}$.

Enjoy :)